The Data Set contains information about Prosper loans. It contains variables related to: the amount, term and interest rate of the loans; the credit rating, employment status, and occupation of the borrowers; and information about lenders, investors, yields and more.
Prosper is an intermediary of peer-to-peer lending in the US.
In this financing model there are three major players:
The borrower who submits an application to loan up to $35,000 for a disclosed purpose and provides documentation to prove their credit rating.
The investor or lender who reviews the borrowing listings with information about the borrower, location, amount of loan, credit rating, fees, verification stage, percentage financed, and estimated return.
The intermediary, in this case Prosper Marketplace, Inc., who verifies the borrower’s documentation, assigns their own credit rating, and for a servicing fee conducts the process of loaning and collecting the money and pay back the investor the amounts that have been repaid by the borrower. In the process, the intermediary keeps a percentage, charges fees, and charges an upfront amount deducted from what the borrower gets. The loans are unsecured and there is a risk for the investor to obtain a lesser return, or even a loss in the transaction.
In this section I am interested in exploring individual variables using histograms, bar charts, summaries, measures of central tendency, and density plots to closely study the distribution of each variable.
After reviewing the long list of 81 variables in this dataset, the ones that caught my attention are: the amount, term, status, purpose and interest rates of the loan, the credit scores, the investors that fund the loan, and some more information about the borrower, like income range ane debt to income ratio.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
Half of the loans were for amounts from $1000–$6500. The amounts borrowed tend to peak around whole thousands, like $10000, $15000, $20000.
##
## 12 36 60
## 1614 87778 24545
##
## 12 36 60
## 1.416572 77.040821 21.542607
Most of the loans are for 36 months (77.0%) and 60 months (21.5%).
## [1] "Summaries of BorrowerRate, BorrowerAPR, LenderYield and EstimatedEffectiveYield"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15630 0.20980 0.21880 0.28380 0.51230 25
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0100 0.1242 0.1730 0.1827 0.2400 0.4925
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.183 0.116 0.162 0.169 0.224 0.320 29084
These interest rates are related to each other, and seem to follow basically the same distribution. The BorrowerAPR should reflect the effective cost of borrowing, including the BorrowerRate plus other fees and costs. The LenderYield is the difference between what the borrower pays and a service fee fee of 1% that goes to Prosper. At first glance, the median values reflect the expected values:
LenderYield (17%) = Borrower Rate (18%) - Prosper (1%)
There are some outliers in the minimum values (-0.01%) and the maximum values (51%). These values could be errors in the data or rare cases for some reason.
CreditGrade rating was used before July 2009. The other ratings are related to each other and are the ones used after July 2009. Actually, the Alpha and numeric ratings are only a recodification of the same rating. with 1=HR to 7=AA, from worse (1=High Risk) to best (7=AA)
##
## Cancelled Chargedoff Completed
## 0.00438839 10.52511476 33.41671274
## Current Defaulted FinalPaymentInProgress
## 49.65551138 4.40418828 0.17992399
## Past Due (1-15 days) Past Due (>120 days) Past Due (16-30 days)
## 0.70740848 0.01404285 0.23258467
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 0.31859712 0.27471322 0.26681412
49.6% of loans in dataset are current, 4.4 are defaulted, and 2.0% are late. The rest 44% are either completed, charged off or cancelled.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 44.00 80.48 115.00 1189.00
## 95% 99%
## 292 466
## [1] 0.2441174 1.0000000
## [1] 20000
24.4% of the loans are funded by 1 investor. The distribution has a very long tail, with a maximun number of 1189 investors for a $20,000 loan.
## 5% 95%
## 52.928 634.040
## 5% 95%
## 0.06 0.50
More information about the borrower:
Most monthly paymets are from $50-$650
The IncomeRange variable levels need reordering. Most borrowers have an income from 25K to 100K
Debt/Income ratio varies from 0.06 - 0.50
Most people borrowed money for Debt Consolidation.
The dataset contains 113,937 observations. Each observation represents a loan described by 81 variables.
There is a group of variables related to the loans (Term, LoanOriginalAmount, LoanStatus); other group about the borrower (BorrowerRate, Occupation, EmploymentStatus, IsBorrowerHomeowner, OpenCreditLines); other about the yields and interest rates (LenderYield, EstimatedEffectiveYield); other about groups of borroweers (CurrentlyInGroup, GroupKey); and a group of variables about the credit score (CreditGrade, ProsperScore, ProsperRating).
Almost half of the loans are current, and half are either completed, charged off, or closed.
Among all the variables, the ones that catch my attention are the interest rates and the credit ratings. Is there a relationship between the interest a borrower gets charged and their credit rating?
LoanOriginalAmount: Are the interests related with the amount borrowed? Investors: Do investors prefer better credits ratings or bigger interest returns? Term: Has term any contribution with the interest rate.
To have a better look at the Loan Listing Category, I created a variable ListingCategory with the data from ListingCategory..numeric. and recoded the factor levels according with the variables’ documentation.
The LoanOriginalAmount has a multi-modal distribution.
The Investors variable, that contains the number of investors that funded the loan, has a long tailed distribution. After applying a log transformation, the distribution looks uniform, aside from the fact that 24% of the loans are funded by 1 investor.
In this section I want to delve into the relationships between the two main features that interest me and other variables in the dataset that sound to be related to them. Briefly, they are:
BorrowerRate, The Borrower’s interest rate for a loan, as related to these variables: credit ratings, loan amount, other interest rates, investors.
ProsperRating(numeric) The Prosper Rating assigned at the time the listing was created: (worst) 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA (best). Applicable for loans originated after July 2009. For this reason I will use a subset of the data with observations that use this rating. I will look into the relationships with other credit scores, the score changes of the borrower, the delinquency history, the inquiries history, the total available credit and utilization ratio, and debt to income ratio.
To judge each relationship I will use scatterplots and linear regression fits. In the process I will try to identify the main features for a linear regression model that explains what influences one’s credit rating.
##
## Calls:
## m.rate: lm(formula = BorrowerRate ~ rating, data = pl.rated)
##
## ===========================
## (Intercept) 0.317***
## (0.000)
## rating: 2/1 -0.024***
## (0.000)
## rating: 3/1 -0.071***
## (0.000)
## rating: 4/1 -0.123***
## (0.000)
## rating: 5/1 -0.163***
## (0.000)
## rating: 6/1 -0.204***
## (0.000)
## rating: 7/1 -0.238***
## (0.000)
## ---------------------------
## R-squared 0.914
## adj. R-squared 0.914
## sigma 0.022
## F 149953.301
## p 0.000
## Log-likelihood 203812.348
## Deviance 40.728
## AIC -407608.696
## BIC -407533.906
## N 84853
## ===========================
There is a clear relationship between the credit score and the interest rate charged to a borrower. R-squared has a high value of 0.914.
## [1] "BorrowerAPR - BorrowerRate"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00000 0.01900 0.02502 0.02604 0.03653 0.14930 25
## [1] "BorrowerRate - LenderYield"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.01000 0.01000 0.01006 0.01000 0.05500
The difference between BorrowerAPR and BorrowerRate is from 1.9% - 3.6%
The difference between BorrowerRate and LenderYield, as expected, is around 1.0%
The amount borrowed tends to increase when the rate decreases (better rating).
Higher rates or amounts result in higher monthly payments, which could explain the choice of a longer Term.
The right part of the graphs:
Bigger Loan Amounts require more investors to fund it.
Lower rates mean better credit rating and hence more investors get involved. Less investors take the risks of funding higher rates (lower credit score) even when they can expect higher returns.
The left part of the graphs:
One or a few investor fund loans of different ammounts, but the lower rates suggests that the credit score has some influence.
In an effort to explore what variables are related or determine the Prosper Rating, I will follow some ideas from this Reader’s Digest artilce explaining how credit scores in general are calculated. In the case of Prosper’s rating, they also count on the borrowers credit score from other rating agencies.
One clear relationship is between the borrowers credit scores that come from other agencies (hence the lower and upper limits) and the score or rating assigned by Prosper.
The score change related to the score when previous Prosper loans were granted, has some influence too.
All the above variables show some relationship with the Prosper Score, but the variable that shows more influence is the percentage of Trades Never Delinquent.
The inquires made to a borrower’s credit history when they apply for new credit has influence in the credit score. These two variables show some relationship with the Prosper Score.
These two variables show a clear relationship with the Prosper Rating.
4.a Other measurments of Total Credit
Prosper Principal Borrowed, the total principal borrowed on Prosper loans at the time the listing was created, shows some relationship with Prosper Rating. The other 3 seem not to influence it greatly.
This ratio seem to influence the Prosper Rating. The bigger this ratio, the less money is available to pay new debt or dayly expenses.
BorrowerRate vs. ProsperRating
The relationship between these two variables is very strong. Borrowers with better rating, get loans with lower interest rates.
Interest Rates
The different types of interests are related with one another:
The BorrowerRate is the rate charged to the borrower according to their rating.
The BorrowerAPR is the annualized percentage rate that the borrower pays for the loan. It is higher than the BorrowerRate.
The LenderYield is 1% less than the BorrowerRate, after a servicing fee for Prosper
These relationships seem confirmed by the data.
LoanOriginalAmount, BorrowerRate, Term, Investors
Higher amounts and higher rates, tend to require a longer term.
Investors, the number of investors involved in funding a loan, seem to have a different behaviour for the range from 1-20 investors, than for more than 20 investors. It seems related to the interest rate, the credit rating and the loan amount.
ProsperRating
This variable is very interesting:
On the front end, it determines what interest rate will be charged for a loan.
On the back end, it seems to be related to (or determined by many other variables that define the borrower:
The relationship between ProsperRating..numeric. (from 1 worst to 7 best) is strongly related to the BorrowerRate. The R-squared value is 0.914
Having found a strong relationship between the BorrwerRate and the ProsperRating, I want to check up on other variables that are known at the time the loan is being requested: the term and the loan amount.
I also want to see how the rating influences the other interest rates, not only borrower’s rate, but APR and lender’s yield.
I deem it worthy to explore the investor’s wisdom on the matter: How do investors act towards ratings and interests.
Finally, having identified the variables that contribute to the observed variations in the rating, I’ll try to build a linear regression model that sheds some insight into the main contributors to one’s credit rating.
##
## Calls:
## m.rate: lm(formula = BorrowerRate ~ rating, data = pl.rated)
## m.rate.1: lm(formula = BorrowerRate ~ rating + factor(Term), data = pl.rated)
##
## ============================================
## m.rate m.rate.1
## --------------------------------------------
## (Intercept) 0.317*** 0.283***
## (0.000) (0.001)
## rating: 2/1 -0.024*** -0.025***
## (0.000) (0.000)
## rating: 3/1 -0.071*** -0.073***
## (0.000) (0.000)
## rating: 4/1 -0.123*** -0.127***
## (0.000) (0.000)
## rating: 5/1 -0.163*** -0.166***
## (0.000) (0.000)
## rating: 6/1 -0.204*** -0.206***
## (0.000) (0.000)
## rating: 7/1 -0.238*** -0.238***
## (0.000) (0.000)
## factor(Term): 36/12 0.035***
## (0.001)
## factor(Term): 60/12 0.045***
## (0.001)
## --------------------------------------------
## R-squared 0.914 0.922
## adj. R-squared 0.914 0.922
## sigma 0.022 0.021
## F 149953.301 126057.655
## p 0.000 0.000
## Log-likelihood 203812.348 208257.947
## Deviance 40.728 36.676
## AIC -407608.696 -416495.895
## BIC -407533.906 -416402.408
## N 84853 84853
## ============================================
Taking into consideration the variation observed in the relationship between the Prosper Rating and the Borrower Rate by the Term of the loan, and updating the linear model, the R-squared increases to 0.922
Another way to see the realationship between a better rating (lower risk). In these graphs it is clear the position of better rating loans toward the end of the graph with lower interest rates.
In general, the lower the interest rate, the higher the amount loaned, and this corresponds to a better rating.
## [1] "90 and 95 quantiles of Investors by ProsperRating"
## Source: local data frame [7 x 3]
##
## ProsperRating..numeric. q90 q95
## 1 1 75 86.00
## 2 2 84 108.00
## 3 3 148 184.35
## 4 4 159 207.00
## 5 5 213 275.00
## 6 6 270 344.00
## 7 7 388 451.00
More investors pitch in when the rating is better, and most of the loans funded by one investor are for borrowers with good rating.
The second graph is another way of seeing the relationship between BorrowerRate and ProsperRating: horiontal lines define a rate corresponding to a rating. Lower rates for better ratings.
The number of investors seem distributed across all the graphic, but the range of investors increases by rating.
##
## Calls:
## fit: lm(formula = ProsperRating..numeric. ~ CreditScoreRangeLower +
## CreditScoreRangeUpper + TradesNeverDelinquent..percentage. +
## TotalTrades + BankcardUtilization + DebtToIncomeRatio + AvailableBankcardCredit +
## InquiriesLast6Months + CurrentDelinquencies - 1, data = pl.rated)
##
## ===============================================
## CreditScoreRangeLower 0.347***
## (0.005)
## CreditScoreRangeUpper -0.333***
## (0.005)
## TradesNeverDelinquent..percentage. 1.072***
## (0.043)
## TotalTrades 0.002***
## (0.000)
## BankcardUtilization -0.379***
## (0.019)
## DebtToIncomeRatio -0.774***
## (0.014)
## AvailableBankcardCredit 0.000***
## (0.000)
## InquiriesLast6Months -0.304***
## (0.003)
## CurrentDelinquencies -0.067***
## (0.004)
## -----------------------------------------------
## R-squared 0.921
## adj. R-squared 0.921
## sigma 1.255
## F 99835.974
## p 0.000
## Log-likelihood -127662.076
## Deviance 122146.360
## AIC 255344.152
## BIC 255436.740
## N 77557
## ===============================================
##
## Call:
## lm(formula = ProsperRating..numeric. ~ CreditScoreRangeLower +
## CreditScoreRangeUpper + TradesNeverDelinquent..percentage. +
## TotalTrades + BankcardUtilization + DebtToIncomeRatio + AvailableBankcardCredit +
## InquiriesLast6Months + CurrentDelinquencies - 1, data = pl.rated)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0169 -0.7939 0.1057 0.8651 9.3261
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## CreditScoreRangeLower 3.470e-01 4.682e-03 74.122 < 2e-16
## CreditScoreRangeUpper -3.327e-01 4.569e-03 -72.812 < 2e-16
## TradesNeverDelinquent..percentage. 1.072e+00 4.337e-02 24.717 < 2e-16
## TotalTrades 1.930e-03 4.144e-04 4.658 3.19e-06
## BankcardUtilization -3.794e-01 1.857e-02 -20.428 < 2e-16
## DebtToIncomeRatio -7.740e-01 1.431e-02 -54.082 < 2e-16
## AvailableBankcardCredit 1.357e-05 3.036e-07 44.690 < 2e-16
## InquiriesLast6Months -3.036e-01 3.308e-03 -91.769 < 2e-16
## CurrentDelinquencies -6.656e-02 4.262e-03 -15.618 < 2e-16
##
## CreditScoreRangeLower ***
## CreditScoreRangeUpper ***
## TradesNeverDelinquent..percentage. ***
## TotalTrades ***
## BankcardUtilization ***
## DebtToIncomeRatio ***
## AvailableBankcardCredit ***
## InquiriesLast6Months ***
## CurrentDelinquencies ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.255 on 77548 degrees of freedom
## (7296 observations deleted due to missingness)
## Multiple R-squared: 0.9206, Adjusted R-squared: 0.9205
## F-statistic: 9.984e+04 on 9 and 77548 DF, p-value: < 2.2e-16
The Q-Q plot shows the proposed model’s goodnes of fit.
BorrowerRate
The interest rate that the borrower gets for a loan is related to their credit score, in this case the ProsperRating.
When the borrower submits an application for a loan, the other relevant variables are the term, amount and purpose of the loan.
After exploring the relationship between these variables with the interest rate, the only variable that shows some variations, contributes or strengthens the relationship is the term of the loan.
I would not say that the term is a factor determining the interest rate, but perhaps that the relationship observed is because people with better rating choose shorter terms.
ProsperRating
Trying to determine what other variables influence or determine the ProsperRating of a borrower, I explored many variables in the dataset, grouping them in these categories:
Credit scores from other agencies: There are two variables in the dataset that reflect them as a lower and upper credit score. There is also one that shows the change in credit score since the last application for a Prosper loan.
Payment History: There are some variables from outside Prosper showing past delinquencies, old and recent; and other variables from past Prosper loans that reflect on time and late payments.
Available Credit and Total Debt: The amount of credit available through bank credit cards, and the ratio of utilization or indebtment.
New Credit: The variables that record the inquiries made to a borrower account to obtain new credit.
Debt to Income Ratio: One variable holds this information.
After exploring the variation of the ProsperRating with these variables, I grouped them again according to their influence in the rating:
I didn’t expect an interaction between the term of a loan and its interest rate. We can’t say that the term influences the interest rate, but a borrower that is applying for a loan and gets it at a higher rate, would prefer a longer term to pay.
Another surprising interaction was the distribution of the number of investors by rating. But in hindsight, it is reasonable that most investors prefer to minimize the risk, while there are still a number of investors who would tolerate more risk for a higher return.
Having seen that the interest rate assigned to a borrower is strongly related or determined by their credit score, and knowing that Prosper assigns their own credit score to their borrowers, I wanted to explore if there were variables that could explain the origin of such score.
Prosper used a credit score called CreditGrade that was assigned at the time the loan listing went live and was used for loans before July 2009.
After July 2009, Proper assigns a credit score called ProsperRating. There are two equivalent versions of this score in the dataset, one numeric ranging from 1 (worst) to 7 (best) and the other has the corresponding levels of HR (worst), E, D, C, B, A, and AA (best).
I restricted my exploration only to those loans after July 2009 that use the new ProsperRating.
For the model I chose the following features:
After performing a linear model analyisis with these features on the subset of data that uses the ProsperRating, I obtained an R-squared of 0.92 that indicates a strong relationship.
The model shows that there is a strong relationship between the chosen features and the ProsperRating. The Q-Q plot of the model shows a good fit, with residuals following a normal distribution in the central portion. The model served to identify the variables that influence the most the variability of ProsperRating.
I don’t know the weight that Prosper assigns to each variable. For example: the two credit scores that come from other agencies could be combined into one with a weighted mean, assigning more weight to the lower score than the higher Or the set of variables that define the borrower’s payment history could be assigned more weight than the inquiries made to their account. In the model I am giving the same weight to every variable.
This plot of the BorrowerRate vs. ProsperRating shows that a lower interest rate is expected for a borrower with a better credit rating. (The rating levels go from 1-high risk, to 7-best credit). When grouping by the loan term in months, the interest rate tends to be lower for shorter terms.
This plot reveals some interesting facts about the Annual Percentage Rate (APR) as compared to the nominal interest rate. The purpose of the APR is to make it easier to compare lenders and loan conditions by including all the other fees and charges that make up the cost of borrowing. Just by exploring the dataset, I can’t know what charges and fees are included in the APR: besides the interest rate, a Prosper borrower pays a service fee of 1%, late payment charges and an upfront origination fee (not included in the APR).
The plot shows that lower interest rates correspond to better credit rating borrowers.
Holding an interest rate constant, say 0.10, the APR is bigger for borrowers with slightly lesser credit rating. Borrowers with better rating pay less in fees and charges.
This plot of the Loan Amount vs. the Number of Investors that fund a loan exhibits the importance of the credit rating as perceived by a third person: the investor.
In the previous plots it is evident that the borrower’s credit rating poses great influence on the lender’s decision to charge lower or higher interest rates to the borrower.
This plot shows the investor’s take on the matter:
More investors pitch in to fund a loan when the borrower’s credit rating is better.
There are investors willing to fund the whole credit if the borrower’s rating is good.
It seems that the Loan Amount for borrowers with poor credit rating is capped to around $10K (or self restrained?, or forced by the higher interests?)
The Prosper Loans dataset contains a wealth of information about 114K loans given between Nov 2005 and Mar 2014.
I first started my exploration by reading the name of the variables and their description when it was not clear enough. By this time I had in mind some questions and ideas that I wanted to explore.
I continued exploring individual variables that I had noticed, like the loan amount, term, interest rates, credit scores, loan status, and loan purposes. At first I leaned towards the relationship between the interest rates and the credit score, but because I found it so evident I decided to give it another thought and I read again the definition of the variables. I thought backwards, like a borrower, and decided to explore what would give me a good or bad credit rating that would deserve me a lower interest rate.
I wanted to know what influences one’s credit rating. I found an article by Reader’s Digest explaining how the credit score is usualy worked out and followed its leads through the dataset. I found many variables in the dataset that talked about payment history, total debt, duration, new credits, and types of credit, trying to adapt the article’s hints to my problem at hand. I continued exploring the relationship between some of these variables with the credit rating, making in my mind an idea of which were the main contributors. I identified some variables that influence the credit rating, and some others that don’t influence it too much.
After completing exploring my first hunch: that the interest rate is infuenced by one’s credit rating, I started exploring which variables most influence one’s credit rating. I deem they are:
The linear model shows a good fit, but I think the main limitation is not knowing how much importance do credit agencies in general, or Prosper in this case, how much importance they put on each feature. I have assigned the same weight to each feature in the model, but my intuition tells me that in practice more weight is given to some features, in the interest of the lender.
Other limitation I explored only briefly is the fluctuation of interest rates with time. In the period of 2009 to 2015, the interest rates in general in the market have been stable and low. I explored a little bit in the dataset the change of interest rates through the quarters of each year but I could not discern any definitive pattern. A more thorough analysis should take into consideration the variability of interest rates with time, and perhaps include a variable with the base interest rate, or a weighted mean of the market’s rates, especially if the market start to fluctuate volatily.